Add CUDA Target, Runtime, and Kernel CI Support by sunnycase · Pull Request #1 · sunnycase/nncase

sunnycase · 2026-06-25T04:28:45Z

Summary

This PR adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs.

Motivation

Enable nncase to compile NTT-generated kernels for CUDA and execute them through the native runtime, instead of stopping at code generation.
Keep CPU and CUDA generated modules on compatible runtime/operator ABI boundaries while allowing CUDA-specific device entry points and launch behavior.
Catch CUDA-specific regressions in CI without making ordinary Linux/macOS compiler jobs depend on GPU availability.
Fix issues uncovered while enabling CUDA tests, including CUDA toolkit discovery, device-callable scalar helpers, generated module ABI handling, and reduce-axis normalization in the NTT vectorization/lowering path.

Implementation

Added CUDA target plumbing in Nncase.Modules.NTT, including CUDATarget, CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation.
Added native CUDA runtime support with CUDA runtime module/function classes, loader integration, runtime CMake wiring, and ENABLE_CUDA_RUNTIME gating.
Added NTT CUDA runtime support for topology, remote tensors, distributed operations, vector ops, profiling, and CUDA runtime entry points.
Updated NTT kernels and runtime utilities so generated code can compile for both CPU and CUDA, including device-callable scalar conversions/operators for half and related scalar types.
Normalized negative reduce axes during NTT vectorization/lowering instead of IR construction, preserving IR semantics while fixing CUDA reduce vectorization cases.
Added CUDA kernel test coverage through UnitTestCUDAKernels and enabled it in CI with a dedicated test-x86_64-linux-cuda job.
Kept the regular compiler test job excluding UnitTestCUDAKernels so CPU-only Linux/macOS jobs remain independent from CUDA runtime availability.

Validation

git diff --check
YAML parsing for .github/workflows/compiler-build.yml
dotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restore
Rebuilt the native runtime locally with clang, CUDA 12.8, and ENABLE_CUDA_RUNTIME=ON.
Verified the installed native runtime exposes CUDA runtime support and links against CUDA runtime libraries.
Verified generated CUDA modules compile locally with clang++ and CUDA 12.8 for the reduce/vectorization repro cases.
dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally.
Existing PR compiler/runtime/code-format checks passed before enabling the dedicated CUDA CI job; the latest CI run is rerunning with the new CUDA check included.

Limitations

Full UnitTestCUDAKernels execution requires a Linux runner with an NVIDIA GPU, CUDA toolkit, nvcc, clang/clang++, and the labels self-hosted, linux, x64, cuda. GitHub-hosted CPU runners cannot execute these runtime tests.
The dedicated CUDA CI job currently uses CUDA architecture 120, matching the local validation environment.
The general compiler test job still intentionally excludes UnitTestCUDAKernels; CUDA runtime tests are expected to run only in the dedicated CUDA job.
HuggingFace importer tests still depend on an available HF_HOME cache/configuration in CI and should be revisited separately from the CUDA runtime work.

Backlog

Future CUDA follow-up work is tracked in #2.

github-actions · 2026-06-29T04:06:04Z

Test Results

3 318 tests 3 318 ✅ 1h 54m 25s ⏱️
5 suites 0 💤
5 files 0 ❌

Results for commit 49f5205.

♻️ This comment has been updated with latest results.

This adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs. Motivation: - Enable nncase to compile NTT-generated kernels for CUDA and execute them through the native runtime, instead of stopping at code generation. - Keep CPU and CUDA generated modules on compatible runtime/operator ABI boundaries while allowing CUDA-specific device entry points and launch behavior. - Catch CUDA-specific regressions in CI without making ordinary Linux/macOS compiler jobs depend on GPU availability. - Fix issues uncovered while enabling CUDA tests, including CUDA toolkit discovery, device-callable scalar helpers, generated module ABI handling, and reduce-axis normalization in the NTT vectorization/lowering path. Implementation: - Added CUDA target plumbing in Nncase.Modules.NTT, including CUDATarget, CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation. - Added native CUDA runtime support with CUDA runtime module/function classes, loader integration, runtime CMake wiring, and ENABLE_CUDA_RUNTIME gating. - Added NTT CUDA runtime support for topology, remote tensors, distributed operations, vector ops, profiling, and CUDA runtime entry points. - Updated NTT kernels and runtime utilities so generated code can compile for both CPU and CUDA, including device-callable scalar conversions/operators for half and related scalar types. - Normalized negative reduce axes during NTT vectorization/lowering instead of IR construction, preserving IR semantics while fixing CUDA reduce vectorization cases. - Added CUDA kernel test coverage through UnitTestCUDAKernels and enabled it in CI with a dedicated test-x86_64-linux-cuda job. - Kept the regular compiler test job excluding UnitTestCUDAKernels so CPU-only Linux/macOS jobs remain independent from CUDA runtime availability. Validation: - git diff --check - YAML parsing for .github/workflows/compiler-build.yml - dotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restore - Rebuilt the native runtime locally with clang, CUDA 12.8, and ENABLE_CUDA_RUNTIME=ON. - Verified the installed native runtime exposes CUDA runtime support and links against CUDA runtime libraries. - Verified generated CUDA modules compile locally with clang++ and CUDA 12.8 for the reduce/vectorization repro cases. - dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally. Limitations: - Full UnitTestCUDAKernels execution requires a Linux runner with an NVIDIA GPU, CUDA toolkit, nvcc, clang/clang++, and the labels self-hosted, linux, x64, cuda. - The dedicated CUDA CI job currently uses CUDA architecture 120, matching the local validation environment. - Future CUDA follow-up work is tracked in #2.

sunnycase and others added 8 commits June 25, 2026 04:26

Add cuda test

784210d

Initial cuda support

9df6355

Add cuda target

df859d9

Add cuda runtime module

a515486

Update

5dac66e

Add warp hierarchy

2570775

Remove trailing whitespace

f8ab72e

Apply code-format changes

11bc83a

sunnycase closed this Jun 25, 2026

sunnycase reopened this Jun 25, 2026

sunnycase added 6 commits June 25, 2026 04:37

Gate CUDA runtime build

8961ef0

Fix half fallback type

926ca8b

Exclude CUDA tests from compiler CI

d603370

Align CPU runtime thread entry ABI

f68bbc2

Fix scoped local rdata serialization

114276c

Fix compiler CI profiling and test result permissions

b010247

sunnycase added 10 commits June 29, 2026 04:17

Fix macOS NTT gencode CI failure

8577f98

Skip uncached HuggingFace tests in PR CI

8f5b850

Fix NTT CI test target handling

bd57399

Stabilize ONNX ReduceL1 CI test

8e86755

Fix NTT nested function local buffer ABI

4ccc168

Keep device function calls on operator ABI

f0dcc31

Normalize reduce axes in IR construction

9f91b69

Make half operations callable from CUDA device code

bc13b56

Normalize reduce axes in NTT vectorization

356735a

Find CUDA toolkit when enabling CUDA runtime

1677f37

sunnycase marked this pull request as ready for review July 1, 2026 06:20

Add CUDA kernel CI job

49f5205

sunnycase changed the title ~~[codex] Add CUDA support~~ Add CUDA Target, Runtime, and Kernel CI Support Jul 2, 2026

sunnycase mentioned this pull request Jul 2, 2026

Backlog: CUDA follow-up work #2

Open

4 tasks

sunnycase merged commit ba3b7d6 into master Jul 2, 2026
25 of 27 checks passed

sunnycase deleted the feature/cuda branch July 2, 2026 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA Target, Runtime, and Kernel CI Support#1

Add CUDA Target, Runtime, and Kernel CI Support#1
sunnycase merged 25 commits into
masterfrom
feature/cuda

sunnycase commented Jun 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunnycase commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Validation

Limitations

Backlog

Uh oh!

github-actions Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunnycase commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 29, 2026 •

edited

Loading